Segmentation of Chinese Long Sentences Using Commas
نویسندگان
چکیده
The comma is the most common form of punctuation. As such, it may have the greatest effect on the syntactic analysis of a sentence. As an isolate language, Chinese sentences have fewer cues for parsing. The clues for segmentation of a long Chinese sentence are even fewer. However, the average frequency of comma usage in Chinese is higher than other languages. The comma plays an important role in long Chinese sentence segmentation. This paper proposes a method for classifying commas in Chinese sentences by their context, then segments a long sentence according to the classification results. Experimental results show that accuracy for the comma classification reaches 87.1 percent, and with our segmentation model, our parsers dependency parsing accuracy improves by 9.6 percent.
منابع مشابه
Chinese sentence segmentation as comma classification
We describe a method for disambiguating Chinese commas that is central to Chinese sentence segmentation. Chinese sentence segmentation is viewed as the detection of loosely coordinated clauses separated by commas. Trained and tested on data derived from the Chinese Treebank, our model achieves a classification accuracy of close to 90% overall, which translates to an F1 score of 70% for detectin...
متن کاملOn Closed Task of Chinese Word Segmentation: An Improved CRF Model Coupled with Character Clustering and Automatically Generated Template Matching
This paper addresses two major problems in closed task of Chinese word segmentation (CWS): tagging sentences interspersed with non-Chinese words, and long named entity (NE) identification. To resolve the former, we apply Kmeans clustering to identify non-Chinese characters, and then adopt a two-tagger architecture: one for Chinese text and the other for non-Chinese text. For the latter problem,...
متن کاملSentence Segmentation Using IBM Word Alignment Model 1
In statistical machine translation, word alignment models are trained on bilingual corpora. Long sentences pose severe problems: 1. the high computational requirements; 2. the poor quality of the resulting word alignment. We present a sentence-segmentation method that solves these problems by splitting long sentence pairs. Our approach uses the lexicon information to locate the optimal split po...
متن کاملIdentifying Japanese-Chinese Bilingual Synonymous Technical Terms from Patent Families
In the task of acquiring Japanese-Chinese technical term translation equivalent pairs from parallel patent documents, this paper considers situations where a technical term is observed in many parallel patent sentences and is translated into many translation equivalents and studies the issue of identifying synonymous translation equivalent pairs. First, we collect candidates of synonymous trans...
متن کاملTwo-Phase LMR-RC Tagging for Chinese Word Segmentation
In this paper we present a Two-Phase LMR-RC Tagging scheme to perform Chinese word segmentation. In the Regular Tagging phase, Chinese sentences are processed similar to the original LMR Tagging. Tagged sentences are then passed to the Correctional Tagging phase, in which the sentences are re-tagged using extra information from the first round tagging results. Two training methods, Separated Mo...
متن کامل